Online Anomaly Detection of Industrial IoT Based on Hybrid Machine Learning Architecture

Industrial IoT (IIoT) in Industry 4.0 integrates everything at the level of information technology with the level of technology of operation and aims to improve Business to Business (B2B) services (from production to public services). It includes Machine to Machine (M2M) interaction either for process control (e.g., factory processes, ﬂeet tracking) or as part of self-organizing cyber-physical distributed control systems without human intervention. A critical factor in completing the abovementioned actions is the development of intelligent software systems in the context of automatic control of the business environment, with the ability to analyze in real-time the existing equipment through the available interfaces (hardware-in-the-loop). In this spirit, this paper presents an advanced intelligent approach to real-time monitoring of the operation of industrial equipment. A hybrid novel methodology that combines memory neural networks is used, and Bayesian methods that examine a variety of characteristic quantities of vibration signals that are exported in the ﬁeld of time, with the aim of real-time detection of abnormalities in active IIoT equipment are also used.


Introduction
e industry sector within Industry 4.0 introduces and uses the Internet of ings in all its functions, which has contributed to the implementation of severe innovative leaps [1,2]. Each part of the industrial ecosystem that participates in the production process is accompanied by a massive volume of generated data, which describe its regular operation while also containing frequent anomalies related to every day or improper use [3]. e ability to detect abnormalities in the equipment process in real-time has a significant impact on the overall operation of the industrial environment while offering stability and reliability during its management. [4].
is possibility, especially when attempted without human supervision, is complex as it depends on many factors. Initially, the key to this process is to determine the exact boundary between normal and abnormal operations, which presents the most significant difficulty in this analysis [5]. To select the appropriate anomaly detection technique, several factors must be weighed. e most important is the nature of the data produced, i e., binary, continuous in time, or discrete values and their relationship. e availability of data and the determination of the type of deviant behavior should also be specified. More specifically, it should be determined whether it refers to an anomaly of a point pattern or whether the specific behavior is under conditions [6].
Point anomaly refers to detecting a point whose value differs from the rest of the data set. e type of data set to which this anomaly refers is the one where its values have a specific range and, in regular operation, do not exceed the maximum and minimum value that it determines. erefore, a sharp increase or decrease in a value that simultaneously exceeds these limits can be described as an abnormal behavior of the equipment or operation. Such anomalies are best detected before processing the data set and analyzing it. [7].
Another category of data anomalies is the one that presents a specific and repetitive pattern over time, including fluctuations in their values. Any deviation from this pattern can be considered an anomaly [8].
Finally, some data behaviors are considered anomalies under conditions. A typical example is a bottleneck in an industrial network regarded as normal behavior during industrial working hours, such as in the morning or noon, but late at night is not normal and is an anomaly for network traffic flow. So, in this type of anomaly, the data values and the conditions that characterize them are essential [9,10]. e industrial systems that are usually involved in the production process of a unit show a gradual deviation from the regular operation and not instantaneous failure. is makes it possible to predict deviant behavior, as there is a time interval between faults. As it is easily understood, the development of methods to detect anomalies is significant. ey reduce the chances of an unexpected failure of the industrial system, which can have huge costs. [11].
All the abovementioned reasons led to the orientation of many research works on methods based on data analysis to detect abnormalities in the operation of industrial equipment. Based on the literature, detecting anomalies in a function can be done with statistical methods, but machine learning algorithms are also of particular interest.

Literature Review
e research community is continuously trying to implement machine learning technologies to take advantage of its unique characteristics related to anomaly detection, especially in environments that tend to produce extremely high amounts of data, like IIoT [1]. is section presents some recent studies in this field, some of which propose the relatively new Federated Learning framework.
Shah et al. [3] in 2018 investigated various machine learning algorithms for detecting anomalies in IIoT data from motor equipment. ey analyzed the sensor data on fuel consumption, engine load, and oil pressure to determine when a certain engine exhibits anomalous behavior and may fail. To discover aberrations in machine behavior, they used multivariate linear regression, Gaussian mixture models, and time-series data analysis. To conclude their research, they employed simple statistical analysis to answer some of these scenarios, while machine learning algorithms were applied in others.
Zhou et al. [4] offered an overview of existing network anomaly detection algorithms as well as a brief description of the requirements and obstacles in IIoT network security. ey also presented alternative anomaly detection approaches specifically suitable to IIoT networks. Such methods take advantage of the physical world's deterministic properties to find anomalies in observed behavior. e approaches based on specification descriptions and physical process modeling consider the operating dynamics of the underlying physical system and discover abnormalities from essential system characteristics.
ese techniques can be used in conjunction with cyber security techniques that detect anomalies caused by data modification. Finally, they proposed that for IIoT network anomaly detection, integrated cyber security and physical state estimate techniques would be more successful.
Four popular SCADA IIoT protocols, as well as their security flaws, were described by Zolanvari et al. [12]. Following that, they conducted a risk audit of the most significant and common security issues in IIoT systems and how machine learning-based solutions could assist in mitigating them. ey showed a use case that included a real-world testbed constructed to perform cyber-attacks and designed an intrusion detection system. ey used real-world cyber-attacks against this system to demonstrate how machine learning could address the identified gap by effectively handling them and measured the performance using representative measures to provide a fair assessment of the methods' effectiveness. Finally, feature priority ranking was investigated to emphasize the most important characteristics in separating malicious from normal traffic.
To fight against malicious actors, Yan et al. [13] presented a Hinge classification algorithm based on mini-batch gradient descent with an adaptive learning rate and momentum (HCA-MBGDALRM) in 2020. ey stated that their method outperformed established approaches in terms of scalability and speed for deep network training. ey also fixed the data skew issue during the shuffle phase. ey developed a parallel framework for HCA-MBGDALRM to speed up the analysis of large traffic data volumes and enable IIoT safety improvements. Finally, they found that their technique increased the model's training efficiency and accuracy, ensuring the reliability of big data networking in IIoT.
Liu et al. [14] proposed the federated learning (FL) framework, which allows for decentralized edge devices to collaborate to train a deep anomaly detection (DAD) model, improving its generalization capabilities. ey also developed a convolutional neural network long short-term memory (CNN-LSTM) model for detecting anomalies. ey used their units to capture fine-grained characteristics, while maintaining the LSTM unit's benefits in forecasting time series data. e findings confirmed that their approach could maintain appropriate precision across all data sets and that the technique could enhance information flow 300 times without making any mistakes by reducing gradients. ey even proposed a gradient compression mechanism to lower communication expenses and increase communication efficiency in order to accomplish real-time and lightweight anomaly detection.
Finally, Wang et al. [15] in 2021 suggested an anomaly detection system for IIoT, which utilized federated learning to improve confidentiality for various IIoT applications. ey used this method to create a universal anomaly detection concept, training each local model using the deep reinforcement learning technique without aggregating local data sets to protect confidentiality. eir solution had the advantage of not requiring local data sets during federated learning, which decreased the risk of privacy compromise. e federated deep reinforcement learning algorithm then adequately identified anomalous users. e proposed 2 Computational Intelligence and Neuroscience technique demonstrated positive outcomes in diverse IIoT scenarios, according to the validation studies. e proposed approach of this work aims at detecting signs of equipment behavior change in real-time to identify anomalies before the collapse of a system, which may be due to damage or cyber-attacks [16].

Proposed Methodology
e primary idea of the proposed methodology is based on the logic of the P-F curve, which gives a representative picture of the behavior of equipment in regular operation and in that which presents abnormalities [8]. erefore, the aim is to predict the point at which the behavior of the equipment begins to change (point P) much earlier than would be perceived by the person in charge of the operation of the equipment from indications that would appear. e technological innovations related to sensors contribute to achieving the abovementioned goal for monitoring the production process and then recording the data necessary for the subsequent analysis by the proposed hybrid machine learning system [7].
More specifically, the proposed approach contains three basic algorithms. e first implements a long short-term memory (LSTM) neural network [17], the next implements the time-domain feature extraction process, and the last is the Bayesian online changepoint detection [18]. Initially, the measurements recorded in real-time by the sensors are taken as input from the LSTM neural network, and the predicted values for the exact quantities are generated in a period predetermined by the network composition. ese values then feed the algorithm to extract the features needed for the upcoming data analysis. It is also important to note that the data taken as input to this process are not the sum of the data of each size but are selected from a time-varying fixed-size window [19]. Finally, the signal resulting from the output of the power supplies the change point detection algorithm, which indicates through a graph the probability that a point is a change point. At the same time, through the probabilities that it calculates, it predicts the future state of the equipment in the period above determined by the neural network. is procedure is followed because the change point detection algorithm is sensitive to noise and should be subtracted from the signal to be inserted into it to output information at a higher level than the original data and reduce the uncertainty it initially includes [20].

Long Short-Term Memory Neural
Network. LSTM neural networks can retrieve information from a significant number of past time steps, providing satisfactory results in problems with serial data and especially time series. It is a chain of similar neural devices, but each consists of four interact levels.
e synthesis and function of the basic structures of an LSTM cell can be attributed to steps as follows [17,21,22]: e output of the last cell and the current input to the cell are combined in one vector, where it concerns all the data to be processed: (1) e above vector goes through the "forget gate," and the following function is generated: is function is multiplied by the previous memory state so that we obtain the following: e vector [h t−1 + x t ] passes through the "input gate," and the following function is generated: e same vector passes through the tanh function and produces the following: e results of the previous two steps are multiplied by each other to obtain the following: Adding the results produced by the multiplications in the above steps results in a new memory state: e vector [h t−1 + x t ] passes through the "output gate," and the following function is generated: e new memory state operates as a tanh function, and the result is multiplied by the above function producing the new cell output: 3.2. Feature Extraction. e data collected by sensors usually require a pre-treatment to remove the noise contained in the signal due to their use and minimize the uncertainty caused by them. In addition, in this way, information is produced with sufficient accuracy so that the analysis of the data subsequently produces reliable results. Especially in vibration signals, such as those made mainly in industry, the extraction of characteristic quantities is necessary when the data analysis involves the detection of errors and the prediction of various quantities [23]. e method used to extract feature sizes is done in the time-domain features extraction. ese characteristics refer to the mean, which calculates the quotient resulting from the sum of the signal values and their number. e root mean square (RMS) calculates the square root of the mean value of the signal raised to the square. is size increases gradually as an error develops within the signal to be studied, but it cannot provide information at the initial stage of error development [20,24,25]. In addition, variance is a quantity that indicates the scatter of the signal using the mean as a reference. In contrast, standard deviation (std) determines the square root of the signal variance. More specific sizes, capable of delivering more information, are kurtosis and skewness, which process the probability density function of the signal. More specifically, kurtosis calculates the maximum value of the function and indicates whether the signal can respond immediately to a change. Under normal conditions, the kurtosis emitted by a vibration signal is approximately equal to a pre-agreed value, e g., three. At the same time, if there are errors in it, then the probability density function changes, and therefore, the value of the curvature is greater than that of regular operation. Accordingly, skewness is a quantity obtained from the mean value of the probability density function and is used to indicate whether the vibration signal is negatively or positively skewed. In a signal with normal distribution, the skewness has zero value. Still, if it is disturbed due to errors, then it will receive either a negative or a positive value depending on the skewness it will present. In addition, it is worth noting that the abovementioned two values can be applied to signals that are not purely continuous in time (stationary) in contrast to characteristics such as mean and standard deviation [26,27].
Another quantity that can be deduced from the probability density function for vibration signals is entropy, which calculates the histogram of the above function and indicates the magnitude of the randomness and uncertainty of the signal. Finally, the lower and upper bound histograms belong to the same category, which calculates the maximum and minimum values of the probability density function, respectively [18,28]. e abovementioned are the appropriate characteristics for processing vibration signals coming from the industrial sector. Also, the attributes in question result in a value calculated from the total of the data studied. However, in this approach, the feature extraction process is rolling, and the result obtained is a curve with the values calculated at each step. In other words, this process uses a window of a specific size so that the calculation of features is not done from the whole data set but from a subset of a fixed size that moves over time. erefore, the analysis of the exported characteristics includes values from previous times and the current one. e amount of these historical data is determined by the window set for calculating each attribute. e use of the window and therefore, the historical data result in the extraction of information at a higher level and greater computational efficiency [7,24,29]. e work used sliding windows, and the exact window size for each feature was determined after testing to achieve the best result in terms of information to be extracted from the feature display.

Bayesian Online Changepoint Detection.
A time series is a collection of observations in chronological order. ese data are large, so they take up more memory, are multidimensional, and are constantly updated. Another characteristic of them is abrupt changes in their structure, such as a jump in a much higher value than the previous one or a different behavior in data distribution.
ose points that change behavior is called change points and essentially split the data into homogeneous parts. Detection abnormalities in a state of operation is a process in which abrupt changes in serial data are identified and performed in real-time or afterward. Most algorithms with "Bayesian" logic focus on the fragmentation of the data set and on techniques that produce results from their subsequent analysis. Still, the algorithm used in this study focuses on identifying the cause of the problem. During execution, it creates a distribution of the next value in the data sequence, taking into account only the values that have been recorded so far. is approach is suitable for detecting points of change in time series due to its ease in quantifying the probability that a position is a point of difference [10,21,30]. e quantities to be studied are a series of time-determined observations divided into various dissimilar and nonoverlapping areas with a specific length. At the same time, the boundaries between them are the changepoints. It is also considered that these observations are independent and uniformly distributed random variables with probability distribution P(x t |n ρ ) where n ρ are independent and similarly distributed random variables. In addition, a grouping of observations between time instant a and b is denoted by x a b and the preset probability distribution in the space between two change points with P gap (g).
is approach estimates the subsequent probability distribution at the current data length r t at time t. Data length r t is a time-dependent function, which is zeroed when a state change occurs, i e., it encounters a change point and refers to the data set from the most recent change point to that time point. In addition, the observations concerning a data length r t are denoted by (r), while if the data length is zero (r � 0), then they are denoted by x (r) .
To predict the probability distribution at the current data length, the ex-post probability distribution must first be calculated retrospectively, and the marginal prediction distribution through the following formulas must be integrated into it [24,31,32]: t , r t P r t | x 1: t P r t | x 1: t � P r t , x 1: t P x 1: t P r t , x 1: t � r t−1 P r t , r t−1 , x 1: t .
e formula P(r t , x 1: t ) calculates the probability density function in the current data length retrospectively to calculate the ex-post probability distribution. Finally, it is worth noting that the forecast distribution depends only on 4 Computational Intelligence and Neuroscience recent data (r). erefore, the probability density function can be calculated retrospectively based on the current data length r t , given r t−1 , and the predictive distribution results from the new value observed, given the values observed so far.
To calculate the abovementioned quantities and formulate retrospective formulas, it is necessary to define the limit conditions based on two considerations. In the first case, a point change has occurred before the first value of the data to be studied, and therefore, the probability function is zeroed for the initial data length. In the second case, on the other hand, the study is done on a recent subset of data, and the boundary condition is formed by the normalized survival function, which indicates the time at the end of which one or more events occur. e conditions in the mathematical form are shown below [25,33,34]: where Z is the normalized constant and S(τ) � ∞ t�r+1 P gap (g � t). e computational efficiency of the algorithm is due to the form of the probability function that there is a point of change based on previous data. is function is zero everywhere except when the data length increases as a new value are added to it and when a unique change point is observed. e function of this probability is shown below [18], [35]: where H(t) is the hazard function and represents which pieces of data have a higher or lower probability of an event occurring and is equal to . (13) e exponential models are a handy tool for detecting anomalies and essentially for the algorithm described in this work. ey are easy to use because they can offer a set of parametric probability distributions and statistical quantities, which can be calculated during data collection. e probability format based on these models is reported for completeness and is shown in the following equations [36], [37]:

Data Set, Scenarios, and Results
Data from an industrial plant that performs cold rolling on metals were used for the present study. is process substantially reduces the thickness of the metal to the optimum smaller thickness with a perfectly smooth surface or reshapes it through two rollers, which rotate in the opposite direction from that of the metal. e metal temperature must be lower than where the metal recrystallizes to do this process. Ten sensors have been installed in the cold rolling equipment to collect data on vibrations to implement the experiments. Also, in the cold rolling unit, there is a sensor that measures the speed of the motor and another that measures its current. In this study, we are only interested in the data collected by the former. In more detail, these sensors record values for four different variables every ten seconds for vibration data. ese are the acceleration, the state of the rollers in terms of resistance and vibration (overall bearing), the abrupt change of state (shock), and the speed (velocity). ese measurements cover ten months, and the number of files created by different sensors was ten in number, with a size ranging from 870,000 to 990,000 particles of data.
Each of the re-created files contains six columns, the first of which refers to the name of the roller, depending on the position of the sensor in it. e second column has the date and time in timestamp UTC so that it is expressed globally and not locally, and the other four columns refer to the values of the quantities produced by the sensors, and these data are time series. From the statistical analysis of the data, it initially appears that the magnitude "shock" describes the existence or not of a significant disturbance in the operation of the device. Its non-existence is illustrated with a zero value while the opposite state with a positive value. e acceleration takes only positive values with a minimum value close to zero while the maximum does not exceed 7. Similar behavior is shown by the second size overall bearing, with its maximum value not exceeding 8. Finally, velocity indicates that the minimum price is close to 0, but its maximum value is much higher than the two previous sizes. An indicative representation of the sizes in question is presented in Figure 1. e diagrams above show the data for each of the quantities recorded by the sensor. From the figure regarding acceleration, we observe a value that differs from the rest but allows us to see the variation of the other values, as there is no considerable difference between the maximum and the rest. If we capture some values before this maximum value, as shown in the velocity figure on the right, we will notice a difference between the speed data, and they are not zero. In the case of velocity, on the other hand, it is evident that the fluctuation of the values does not exist, and they are all presented as a straight line very close to zero, which is interrupted by a vertical that represents the maximum value. e high-velocity value probably comes from a disturbance in the sensor environment. It is not interpreted in a natural way to be related to the behavior of the equipment [9,29,33]. e experimental process aims to detect anomalies in the industrial data. e diagrammatic representation of the experiment is shown in Figure 2.  Computational Intelligence and Neuroscience Acceleration, overall bearing, and velocity values are predicted via the LSTM neural network during the experimental process. In the intermediate stage of the process, the characteristics are extracted from vibration signals (raw data) in the time domain. e procedure concerns the following sizes [17,20,21,27]: which are used to detect differences between vibration signals. en, the feature extraction algorithm is applied, producing five rolling features. Specifically, let τ 0 ∈T be the time of submission of the relevant duration query, then the scope of a rolling window with width ω and step δ for each τ∈ T (with τ ≥ t0) extends [19,30]: where the magnitudes τ 0 , τ∈T are expressed in time landmarks and ω, δ∈N are expressed in a range of time intervals (ω, δ > 0). For the sake of simplicity, the abovementioned definition implies that all-time quantities are expressed as natural numbers so that the function is calculated at distinct time points of T. en, the window multiples result from the relation [19,38]: Usually, step δ is the same size as the unit of time (e g., second) so that the window's progress is ideally in line with the corresponding time. Because δ<? is generally valid, the contents of two consecutive snapshots of the popup window overlap. Meanwhile, its contents remain unchanged until the function is applied again to the next pulse, after δ time. is is expressed by the retroactive expression in the middle branch of the function: the edges of the window change only at times specified in step δ. e third part of the function provides the possibility of initial "missing" windows immediately after the query when the range exceeds the time range of the current contents.
Since the range function is monotonous due to the evolution of time implies a homologous passing of the intervals and can be defined even for future moments. All the following elements of the current are covered, regardless of when and if they finally appear. For example, suppose the kurtosis attribute is exported with window � 25. In that case, the first value of this attribute will be extracted from the data to be studied in positions 1 to 25, while the second value from the data in positions 2 to 26 and so on. e term position means the order in which the data from the sensor have been recorded in this case.
Finally, for each time τ∈T, the window connection operator returns the joining of pairs of blocks that appear in the respective window snapshots [19]: It is essentially a sliding-window join process, separately for each stream. Each new input block of current S 1 is checked for connection condition E with all existing rolling window W 2 of the stream S 2 . e exported results receive the most recent timeline displayed in the primary tuples pair if matching elements are found.
is is, after all, the time indicated that restores the order of the final results of the generated data stream.
e Bayesian online changepoint detection algorithm [20] is then applied to each of the characteristics generated for each size studied to detect a change in equipment operation through feature analysis. is process can be captured in the following steps [17,20,39]: Step 1. Initialization through following marginal conditions: Step 2. Observation of the following data value.
Step 3. Predictive probability estimation as follows: Computational Intelligence and Neuroscience Step 4. Probability calculation as the current data length increases as shown below: Step 5. Calculation of probability for the existence of a point of change as follows: Step 6. Probability calculation in the observation group, from the first to the time t: Step 7. e data length distribution is defined as follows: Step 8. e parameter value as shown below: Step 9. Calculation of the marginal forecast distribution: Step 10. Return to observe the next value. e temporal and spatial complexity of the algorithm is linear about the amount of data to be processed and is calculated at each time point based on the data observed so far. e abovementioned description gives a clear picture of the operation and purpose of the algorithm, which calculates the quantities required to calculate the probability that each point is a change point (log-likelihood). In particular, the algorithm accepts as input the data in which it is desirable to detect anomalies and, as a result, is initially extracted for each time corresponding to the data set, the value of the probability to be a point of change. ese values are used to create the diagram that presents the above information graphically. More specifically, the values of the probabilities that are calculated refer to the size log-likelihood, which has a logarithmic character and a negative sign. e vertical axis is on a logarithmic scale in the diagram formed, and the horizontal one represents time.
Another quantity that results from applying the algorithm is the position of the point most likely to be a point of change. Although the generated curve may have enough points with a sufficient probability value to show a state change, the algorithm returns the highest probability value, i e., the point where the curve is at the highest position on the chart.
is approach returns to the time when behavior change is most likely. Finally, it returns the average values found in all previous time points from the algorithm's most probable for a state change. Similarly, the average size values studied and found after the aforementioned time are calculated.
e results of the above experiment are shown in the following diagrams. e first row of the figures shows the predicted values of the interest quantities compared to the actual values of these data. e mode of each feature is different, having the same input. Of course, there are several similarities in the attributes mean, std and rms. erefore, the result of the algorithm shows several common points for the specific features. In addition to all the diagrams showing the result of the experiment, a clear picture of the process of detecting anomalies is demonstrated as the peaks of the curves are clear.
Below are the results for some data sets where the procedure of this experiment was applied. In each figure, the first column presents the initial data of each size from acceleration, overall bearing, and velocity for a data set. e following columns show the algorithm's result for each kurtosis feature, skewness, mean, std, and rms. ese curves are the result to be evaluated in the experiment. e results show the recording of the time where the point of change of the equipment operates according to the algorithm is considered. e value forecast was made with a time limit of half an hour.
Finally, an important observation is that applying the algorithm to the characteristics mean, standard deviation, and rms gives better results about kurtosis and skewness. While they exhibit similar behavior, they are different for each data set, and this can be perceived from the time they identify as the most likely to occur abnormalities. In addition, the mean feature realizes earlier the malfunction of the equipment as it presents an anomaly at a previous time from the other elements for all the sizes studied.

Conclusions
e detection and timely assessment of abnormalities in the operation of the industrial ecosystem allows the detection of incidents and the corresponding identification of correlations and causal relationships with security incidents, which can significantly mitigate the effects of sophisticated cyberattacks. In this spirit, a hybrid system of deep machine learning architecture was used to predict anomalies in the operation of industrial equipment. Specifically, the Bayesian online changepoint detection algorithm was used to detect anomalies, the time-domain features extraction process for feature extraction, and the long-short term memory neural network to predict the values of the magnitudes that reflect the correct or not the operation of the equipment. Sensors in real-time record the data used. Because they contain a lot of noise, detecting anomalies without prior pre-treatment produces excellent uncertainty. If the detection is done after the proper pre-processing of the data, the results are distinguished with great accuracy.
Future developments of the work concern the development of more complex deep learning architectures to model the problem more fully in question. Procedures should also be studied to add other parameters as input and improve the model's accuracy and efficiency. At the same time, predictive analysis procedures for new methods that will fully automate extracting data characteristics and detecting anomalies should be studied.

Data Availability
e data used to support the findings of this study are available from the corresponding author upon reasonable request.

Conflicts of Interest
e authors declare that they have no conflicts of interest.